103 research outputs found
Energy-Efficiency Evaluation of FPGAs for Floating-Point Intensive Workloads
In this work we describe a method to measure the computing performance and energy-efficiency to be expected of an FPGA device. The motivation of this work is given by their possible usage as accelerators in the context of floating-point intensive HPC workloads. In fact, FPGA devices in the past were not considered an efficient option to address floating-point intensive computations, but more recently, with the advent of dedicated DSP units and the increased amount of resources in each chip, the interest towards these devices raised. Another obstacle to a wide adoption of FPGAs in the HPC field has been the low level hardware knowledge commonly required to program them, using Hardware Description Languages (HDLs). Also this issue has been recently mitigated by the introduction of higher level programming framework, adopting so called High Level Synthesis approaches, reducing the development time and shortening the gap between the skills required to program FPGAs wrt the skills commonly owned by HPC software developers. In this work we apply the proposed method to estimate the maximum floating-point performance and energy-efficiency of the FPGA embedded in a Xilinx Zynq Ultrascale+ MPSoC hosted on a Trenz board
Porting a Lattice Boltzmann Simulation to FPGAs Using OmpSs
Reconfigurable computing, exploiting Field Programmable Gate Arrays (FPGA), has become of great interest for both academia and industry research thanks to the possibility to greatly accelerate a variety of applications. The interest has been further boosted by recent developments of FPGA programming frameworks which allows to design applications at a higher-level of abstraction, for example using directive based approaches.
In this work we describe our first experiences in porting to FPGAs an HPC application, used to simulate Rayleigh-Taylor instability of fluids with different density and temperature using Lattice Boltzmann Methods. This activity is done in the context of the FET HPC H2020 EuroEXA project which is developing an energyefficient HPC system, at exa-scale level, based on Arm processors and FPGAs. In this work we use the OmpSs directive based programming model, one of the models available within the EuroEXA project. OmpSs is developed by the Barcelona Supercomputing Center (BSC) and allows to target FPGA devices as accelerators, but also commodity CPUs and GPUs, enabling code portability across different architectures. In particular, we describe the initial porting of this application, evaluating the programming efforts required, and assessing the preliminary performances on a Trenz development board hosting a Xilinx Zynq UltraScale+ MPSoC embedding a 16nm FinFET+ programmable logic and a multi-core Arm CPU
An FPGA-based Torus Communication Network
We describe the design and FPGA implementation of a 3D torus network (TNW) to
provide nearest-neighbor communications between commodity multi-core
processors. The aim of this project is to build up tightly interconnected and
scalable parallel systems for scientific computing. The design includes the
VHDL code to implement on latest FPGA devices a network processor, which can be
accessed by the CPU through a PCIe interface and which controls the external
PHYs of the physical links. Moreover, a Linux driver and a library implementing
custom communication APIs are provided. The TNW has been successfully
integrated in two recent parallel machine projects, QPACE and AuroraScience. We
describe some details of the porting of the TNW for the AuroraScience system
and report performance results.Comment: 7 pages, 3 figures, proceedings of the XXVIII International Symposium
on Lattice Field Theory, Lattice2010, June 14-19, 2010, Villasimius,
Sardinia, Ital
Optimization of lattice Boltzmann simulations on heterogeneous computers
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs
Early Experience on Using Knights Landing Processors for Lattice Boltzmann Applications
The Knights Landing (KNL) is the codename for the latest generation of Intel
processors based on Intel Many Integrated Core (MIC) architecture. It relies on
massive thread and data parallelism, and fast on-chip memory. This processor
operates in standalone mode, booting an off-the-shelf Linux operating system.
The KNL peak performance is very high - approximately 3 Tflops in double
precision and 6 Tflops in single precision - but sustained performance depends
critically on how well all parallel features of the processor are exploited by
real-life applications. We assess the performance of this processor for Lattice
Boltzmann codes, widely used in computational fluid-dynamics. In our OpenMP
code we consider several memory data-layouts that meet the conflicting
computing requirements of distinct parts of the application, and sustain a
large fraction of peak performance. We make some performance comparisons with
other processors and accelerators, and also discuss the impact of the various
memory layouts on energy efficiency
Performance and portability of accelerated lattice Boltzmann applications with OpenACC
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test-bench a massively parallel lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the-art architectures
Computational Performances and Energy Efficiency Assessment for a Lattice Boltzmann Method on Intel KNL
In this paper we report results of the analysis of computational performances and energy efficiency of a Lattice Boltzmann method (LBM) based application on the Intel KNL family of processors. In particular we analyse the impact of the main memory (DRAM) while using optimised memory access patterns to accessing data on the on-chip memory (MCDRAM) configured as cache for the DRAM, even when the size of the data of the simulation fits the capacity of the on-chip memory available on socket
Massively parallel lattice–Boltzmann codes on large GPU clusters
This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
Portable multi-node LQCD Monte Carlo simulations using OpenACC
This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code
for staggered fermions, purposely designed to be portable across different
computer architectures, including GPUs and commodity CPUs. Portability is
achieved using the OpenACC parallel programming model, used to develop a code
that can be compiled for several processor architectures. The paper focuses on
parallelization on multiple computing nodes using OpenACC to manage parallelism
within the node, and OpenMPI to manage parallelism among the nodes. We first
discuss the available strategies to be adopted to maximize performances, we
then describe selected relevant details of the code, and finally measure the
level of performance and scaling-performance that we are able to achieve. The
work focuses mainly on GPUs, which offer a significantly high level of
performances for this application, but also compares with results measured on
other processors.Comment: 22 pages, 8 png figure
- …